AdvGLUE

The Adversarial GLUE Benchmark

Performance of SMART_RoBERTa (single model) on AdvGLUE

Overall Statistics

96.6Accuracy 57.045.839.950.991.288.4F1 Accuracy 63.448.054.119.073.832.064.244.395.0Accuracy 66.943.427.752.291.0Accuracy 73.866.270.490.8Accuracy 49.839.145.690.70100Accuracy 51.5010033.1010022.0010036.10100
GLUE DevAdvGLUE WordAdvGLUE SentenceAdvGLUE HumanAdvGLUE OverallSST-2QQPQNLIRTEMNLI-mMNLI-mm

Performance of SMART_RoBERTa (single model) on each task

The Stanford Sentiment Treebank (SST-2)

56.254.960.858.952.8Typo Knowledge Embedding Context Composition 34.762.6Syntactic Distraction 39.90100CheckList
Adversarial AccWordSentenceHuman

Quora Question Pairs (QQP)

56.082.469.068.662.3Typo Knowledge Embedding Context Composition 42.166.738.135.352.954.1Syntactic 19.073.80100CheckList 32.00100
Adversarial AccAdversarial F1WordSentenceHuman

MultiNLI (MNLI) matched

61.150.057.156.342.0Typo Knowledge Embedding Context Composition 36.244.40100Syntactic Distraction
Adversarial AccWordSentence

MultiNLI (MNLI) mismatched

35.160.374.160.349.2Typo Knowledge Embedding Context Composition 27.443.1Syntactic Distraction 15.528.40100StressTest ANLI
Adversarial AccWordSentenceHuman

Question NLI (QNLI)

70.863.460.366.370.5Typo Knowledge Embedding Context Composition 33.856.6Syntactic Distraction 31.625.00100CheckList AdvSQuAD
Adversarial AccWordSentenceHuman

Recognizing Textual Entailment (RTE)

82.677.474.477.863.6Typo Knowledge Embedding Context Composition 60.277.10100Syntactic Distraction
Adversarial AccWordSentence